We will use the classic diamonds dataset in today’s exercise (available as part of data packages in both R and Python, and on kaggle). You can find information about each of the variables below.
price: price in US dollars ($326–$18,823)
carat: weight of the diamond (0.2–5.01)
cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color: diamond colour, from J (worst) to D (best)
clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x: length in mm (0–10.74)
y: width in mm (0–58.9)
z: depth in mm (0–31.8)
depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
table: width of top of diamond relative to widest point (43–95)
Instructions
Use describe() and sns.pairplot() to examine overall trends in the data. Describe what you observe using markdown text.
Which variables seem to influence each other? Make an informed guess about the direction of causality between these variables (you can refer to the wikipedia page on Diamonds if you find it helpful).
Choose any set of appropriate variables to generate each of the following plots. You can refer to the seaborn gallery for examples.
Scatter plot with a single pair of variables
Scatter plot of two variables with a third variable encoded using color
Bar plot with two variables
Grouped bar plot of two variables with a third variable encoded using color
Facet grid showing multiple levels of an ordinal variable on one axis
Violin plot showing a single continuous numeric variable on the y axis and a categorical variable on the x axis
Answers
Your text and code goes here! Use markdown to nicely format your plots and answers.
# import librariesimport numpy as npimport pandas as pdimport matplotlib as mplimport matplotlib.pyplot as pltimport seaborn as sns# import datasetdf = pd.read_csv('./../../datasets/diamonds.csv')df
- The 'carat' column has a low standard deviation, indicating that most of the observed diamonds are around the same weight.
- On the other hand, price has a very high standard deviation as well as a large difference between the min and max price, indicating that the prices observed are very varied.
- The scatterplot showing carat relative to length (x) seems like a curve upwards, showing that weight increases as length increases.
- similar observation for width (y) and depth (z) with a sharper curve
Variables that seem to influence each other:
- price by depth
- price by cut
- weight and length
- carat by dimensions (x, y, and z)
Generating plots
Scatterplot of length (x) and carat
g1 = sns.scatterplot(df, x='x', y='carat')plt.title("Length vs Carat")plt.show()
Scatter plot of two variables with a third variable encoded using color
Choose two plots from the seaborn gallery that we haven’t already used and which are relevant to your EDA. Re-create them here using the diamonds dataset.
Explain why these are appropriate plots for this type of data. What trends or insights are visible?
This plot also uses the diamond dataset and I wanted to see what it would look like without the log scaling. While the log scaling makes the plot more readable and look better, the linear scaling is more intuitive to me.
Based on this plot, it looks like most of the prices in this dataset are in the 500-1500 range. This may mean that most diamonds sell for this range, but to confirm that would take more analysis. This range is also where a majority of the “Ideal” cut diamonds are.
This is another plot that looks very different with log scaling. The plot without log scaling shows so many outlier dots on the upper end of price that it makes the whole plot look weird. On both plots, we can see that the median price for each cut is around the same price (roughly 2500). Something I don’t really understand is why the upper range (the horizontal line on top of each boxplot) looks like it’s the same across cuts on the log scale plot, but different on the linear scale plot.